Chinese NER Using CRFs and Logic for the Fourth SIGHAN Bakeoff

نویسندگان

  • Xiaofeng Yu
  • Wai Lam
  • Shing-Kit Chan
  • Yiu Kei Wu
  • Bo Chen
چکیده

We report a high-performance Chinese NER system that incorporates Conditional Random Fields (CRFs) and first-order logic for the fourth SIGHAN Chinese language processing bakeoff (SIGHAN-6). Using current state-of-theart CRFs along with a set of well-engineered features for Chinese NER as the base model, we consider distinct linguistic characteristics in Chinese named entities by introducing various types of domain knowledge into Markov Logic Networks (MLNs), an effective combination of first-order logic and probabilistic graphical models for validation and error correction of entities. Our submitted results achieved consistently high performance, including the first place on the CityU open track and fourth place on the MSRA open track respectively, which show both the attractiveness and effectiveness of our proposed model.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Improved CRF based Chinese Language Processing System for SIGHAN Bakeoff 2007

This paper describes three systems: the Chinese word segmentation (WS) system, the named entity recognition (NER) system and the Part-of-Speech tagging (POS) system, which are submitted to the Fourth International Chinese Language Processing Bakeoff. Here, Conditional Random Fields (CRFs) are employed as the primary models. For the WS and NER tracks, the ngram language model is incorporated in ...

متن کامل

CRFs-Based Chinese Word Segmentation for Micro-Blog with Small-Scale Data

In this paper, we proposed a Chinese word segmentation model for micro-blog text. Although Conditional Random Fields (CRFs) models have been presented to deal with word segmentation, this is still the first time to apply it for the segmentation in the domain of Chinese micro-blog. Different from the genres of common articles, micro-blog has gradually become a new literary with the development o...

متن کامل

Achilles: NiCT/ATR Chinese Morphological Analyzer for the Fourth Sighan Bakeoff

We created a new Chinese morphological analyzer, Achilles, by integrating rule-based, dictionary-based, and statistical machine learning method, conditional random fields (CRF). The rulebased method is used to recognize regular expressions: numbers, time and alphabets. The dictionary-based method is used to find in-vocabulary (IV) words while outof-vocabulary (OOV) words are detected by the CRF...

متن کامل

Chinese Word Segmentation and Named Entity Recognition by Character Tagging

This paper describes our word segmentation system and named entity recognition (NER) system for participating in the third SIGHAN Bakeoff. Both of them are based on character tagging, but use different tag sets and different features. Evaluation results show that our word segmentation system achieved 93.3% and 94.7% F-score in UPUC and MSRA open tests, and our NER system got 70.84% and 81.32% F...

متن کامل

Cascaded Chinese Weibo Segmentation Based on CRFs

With the developments ofWeb2.0, the process for the data on Internet becomes necessary. This Paper reports our work for Chinese weibo segmentation in the 2012 CIPS-SIGHAN bakeoff. In order to improve the recognition accuracy of out-ofvocabulary words, we propose a cascaded model which first segments and disambiguates in-vocabulary words, then recovers out-of-vocabulary words from the fragments....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008